I really like data visualization. My initial forays into data vizualization led me to the package ggplot2, which seems to be quickly becoming the standard in the R community (or at least the one I see most often). When I first started using it, the syntax seemed pretty daunting (all the aes and geom_whatever arguments were confusing to me), but I was motivated enough by some pretty examples and helpful documentation to hack through other people’s code and make my own plots. I slowly started to get the syntax down, and now making plots with ggplot is mostly intuitive and “fun”. I still have to look specific stuff up once in a while, but I usually know what I’m looking for at least. Anyway, I think ggplot2 is great, though the learning curve can feel a bit steep, so here is a simple tutorial.
library(ggplot2)
I’ll use some data I found about musicals because I was recently reminded of my childhood love for musicals (especially Rodgers & Hammerstein).
I took the inflation-adjusted box office numbers and release year of the 27 top grossing musicals from this website and put them in a spreadsheet. (Why did they only include the top 27 musicals? Seems like a strange number to decide on.)
Anyhow, for each musical I typed “[musical name] tracklisting” and “[musical name] runtime” into google and recorded what came up for each search in my spreadsheet, which can be downloaded on my github (feel free to contribute to the data as well!). There’s probably a way I could have written some code to automatically scrape that data and more, but that would’ve been a whole thing.
Anyway, here is the simple dataset:
| Name | BoxOffice | song_quantity | year | length_minutes |
|---|---|---|---|---|
| Annie Get Your Gun | 127245278 | 31 | 1950 | 107 |
| Dreamgirls | 128941402 | 39 | 2006 | 130 |
| Bye Bye Birdie | 130212874 | 11 | 1963 | 120 |
| Into The Woods | 130894237 | 21 | 2014 | 125 |
| Flower Drum Song | 131191774 | 16 | 1961 | 132 |
| Gypsy | 133397795 | 16 | 1962 | 153 |
| The Rocky Horror Picture Show | 140576117 | 14 | 1975 | 101 |
| Hairspray | 145652571 | 17 | 2007 | 120 |
| Les Miserables | 156592957 | 34 | 2012 | 160 |
| Annie | 163607956 | 16 | 1982 | 128 |
| Gentlemen Prefer Blondes | 168600000 | 6 | 1953 | 91 |
| Mamma Mia | 169222345 | 24 | 2008 | 109 |
| The Music Man | 180087030 | 18 | 1962 | 155 |
| Paint Your Wagon | 188064853 | 15 | 1969 | 164 |
| The Best Little Whorehouse In Texas | 199858769 | 15 | 1982 | 114 |
| Cabaret | 204930552 | 17 | 1972 | 124 |
| Camelot | 218495610 | 12 | 1967 | 179 |
| Chicago | 239112061 | 18 | 2002 | 113 |
| Oliver | 240691792 | 20 | 1968 | 153 |
| The King And I | 359118000 | 21 | 1956 | 144 |
| Funny Girl | 376454194 | 17 | 1968 | 155 |
| Fiddler On The Roof | 408170029 | 17 | 1971 | 181 |
| South Pacific | 456211764 | 16 | 1958 | 171 |
| West Side Story | 533899997 | 15 | 1957 | 153 |
| Grease | 602892685 | 24 | 1978 | 111 |
| My Fair Lady | 652645154 | 15 | 1964 | 175 |
| The Sound Of Music | 1362273686 | 16 | 1965 | 174 |
The variable names are pretty self-explanatory. Across these elite (i.e., truncated range of) musicals, the average number of songs is 18.56 (SD = 6.95) and the average runtime is 138.59 (SD = 26.49)1.
For the sake of this exercise, let’s imagine that the amount of money a musical makes is a decent reflection of its quality (I’m sure there are many reasons why this isn’t valid, but the same could be said for a lot of operationalizations in psychology). Let’s try to find some interesting relationships between some other objective metrics and musical quality with ggplot.
A reasonable person might guess that a good musical would have a lot of music. Maybe there’s a relationship between the number of songs in a musical and how much money it made at the box office. Let’s make a basic scatterplot exploring that relationship.
At minimum, we have to specify the data and the overall aesthetics (aes), which are usually at least the x (song_quantity) and y (BoxOffice) axis. One cool thing about ggplot is that you just add layers to this backbone using a + for each layer. Markers can be added using geom_point (or geom_jitter to prevent points from overlapping too much). If you save the backbone plot (or any additions thereafter) as an object (e.g., myplot), you can simply add additional layers to that object as well (illustrated below).
# specify the backbone
myplot <- ggplot(data = df, # specify data
aes(x = song_quantity, # specify x axis variable
y = BoxOffice)) # specify y axis variable
# add markers to the backbone
myplot + geom_point()
So there’s the visualized relationship between song quantity and musical quality: there isn’t one.
Maybe musical fans just want a good escape from their real life, so musicals that are simply longer—providing more escape—are “better”. Here’s a scatterplot exploring the relationship between runtime (length_minutes) and box office earnings (BoxOffice). I changed the point size, shape, and color just to show how.
plot2 <- ggplot(df, aes(length_minutes, BoxOffice, label = Name)) +
geom_point(color = 'blue', # change color of points
shape = 'triangle', # change point shape
size = 3) # change the point size
plot2
Looks like there might be actually a relationship, although it’s probably driven by that outlier that is the highest-grossing musical. We could look in the dataframe to see what that is, but this is a plotting tutorial so let’s make the plot tell us. We can add text labels to the point using geom_text and change the distance of the labels from the points using vjust and hjust. Because this adds another layer that depends on the data, we have to provide the aesthetics, using the label arguement in this case.
# add text labels with the musical name
plot2 + geom_text(aes(label = Name),
vjust = -.1)
Most those labels are pretty bunched up but we can see that the highest grossing musical is The Sound of Music. We can’t very well get rid of one of the most iconic musicals of all time, so let’s just keep the outlier for the sake of the exercise. The labels look terrible, though, so we should probably get rid of them (we can bring them back later in a pretty way). While we’re at it, let’s make the plot generally prettier by changing the theme and setting the axis labels to something more clear.
There are a number of ways to change the axis labels, scales, and tick marks, but the ones that gives me the most control seems to be scale_x_continuous and scale_y_continuous.
library(scales)
plot2_revised <- plot2 +
scale_x_continuous(name = "Runtime (in minutes)", # new x-axis label
# setting tick mark every 10 minutes
breaks = seq(60, 190, 10)) +
scale_y_continuous(name = "Gross Box Office Earnings (USD)",
# changes y-axis tick labels to dollar values
labels = dollar,
breaks = seq(min(df$BoxOffice), max(df$BoxOffice), 50000000))
plot2_revised
There are a whole bunch of preset themes in ggplot (see full list of default themes here). I like theme_classic.
plot_final <- plot2_revised + theme_classic()
plot_final
I liked being able to see which dot belonged to each musical, but the labels were too ugly. Luckily there is an interactive plotting package called plotly that has a wrapper to turn any ggplot object into an interactive plot.
library(plotly)
int_plot <- ggplotly(plot_final)
style(int_plot, hoverinfo = df$Name)
Now we can see the name of the musical, along with the other data, when we hover over each point. So that’s cool. I’ve played around with plotly a bit and it has much more functionality than I can go into here; it plays well enough with ggplot for stuff like this, but it looks like it might be worth just learning plotly too (maybe in a future blog post).
So that’s some basic ggplot stuff (with a bonus of ggplotly). I haven’t even scratched the surface of what’s possible with ggplot here but hopefully it helps someone get started. Just keep practicing and googling and soon you’ll be as free as Julie Andrews.
In R Markdown you can programatically write these statistics using backticks, specifying the language `r` and then writing a function, e.g., round(mean(df$var), 2) within the backticks, but after the r.↩